In order to integrate a new feature of on/off target levels of coverage in NGSphy, I needed to understand coverage variation that one could find in a capture experiment. Goals are basically:
Data used comes from Rute’s Hummingbird project. Most of the information reported here about the samples and targets was extracted from the paper.
BAM files corresponding to the mappings of the targeted-sequencing to 2 references.
map2Anna).map2Swift).Workspace triploid (UVigo):
triploid.uvigo.es/home/merly/ research/cph-bam-coverageWorkspace randy (KU):
randy.popgen.dk/home/merly/anna/home/merly/swiftTargeted regions are exons, retrieved a posteriori, due to the fact that the original probes for this project was lost during a flood.
Ingroup reference: Original target files given was a GFF file: Calypte_anna.gene.CDS.2750.gff [1.6M]. This file was converted into a BED file, keeping only chromosome (scaffold), start and end position of the targets.
Outgroup reference:: For this one, I received 2 files. One that had the relation one-to-one orthologs between the genes from Anna and the Swift (48birds_ortholog.list.chi.anna.cpe.hum.finch [656K]). The second, is a GFF file, which contains the exons from Swift (Chaetura_pelagica.CDS.gff[8.5M]). I had to get the targeted regions from the GFF files, taking into account that I needed the genes matching to Anna, and also I needed those genes that were actually used in the Anna’s targets (Calypte_anna.gene.CDS.2750.gff):
48birds_ortholog.list.chi.anna.cpe.hum.finch.Chaetura_pelagica.CDS.gff where codon regions (CDS).Calypte_anna.gene.CDS.2750.gff so I’ll have the same covered genes.Chaetura_pelagica.CDS.gff and keep those regions matching both Anna and Swift.WD="/media/merly/Baymax/research/cph-visit/coverage-analysis/files/"
birds48filename=paste0(WD,"48birds_ortholog.list.chi.anna.cpe.hum.finch")
swiftgff=paste0(WD,"Chaetura_pelagica.CDS.gff")
annagff=paste0(WD,"Calypte_anna.gene.CDS.2750.gff")
birds=read.table(birds48filename, header=T,
colClasses=c("character","numeric","character","character",
"character","character","character"))
swift=read.table(swiftgff,
colClasses=c("character","character","character","numeric",
"numeric","character","character","character","character"))
anna=read.table(annagff,
colClasses=c("character","character","character","numeric",
"numeric","character","character","character","character"))
birds.2=birds[birds$anna!="-",]
birds.3=birds.2[birds.2$swift!="-",]
birds.4=unique(birds.3)
birds.5=birds.4[,c("anna","swift")]
birds.5$swift=paste0("Parent=",birds.5$swift,";")
birds.5$anna=paste0("Parent=Aan_", birds.5$anna,";")
annaGenes=unique(anna$V9[birds.5$anna %in% anna$V9])
birds.6=birds.5[birds.5$anna %in% annaGenes, ]
swiftGenes=birds.6$swift
swift.2=swift[swift$V3=="CDS",]
swift.3=swift.2[swift.2$V9 %in% birds.6$swift,]
swift.3$size=swift.3$V5-swift.3$V4
swift.4=swift.3[swift.3$size>0,]
anna.2=anna[(anna$V5-anna$V4)>0,]
swift.4=swift.4[,1:(ncol(swift.4)-1)]
write.table(swift.4[,c(1,4,5)],
file=paste0(WD,"targets.swift.2.bed"),
row.names = F,col.names = F, quote = F, sep = "\t")
write.table(swift.4,
file=paste0(WD,"Chaetura_pelagica.CDS.2.gff"),
row.names = F,col.names = F, quote = F, sep = "\t")
write.table(anna.2[,c(1,4,5)],
file=paste0(WD,"targets.anna.2.bed"),
row.names = F,col.names = F, quote = F, sep = "\t")
write.table(anna.2,
file=paste0(WD,"Calypte_anna.gene.CDS.2750.2.gff"),
row.names = F,col.names = F, quote = F, sep = "\t")
write.table(birds.6,
file=paste0(WD,"Calypte_anna.Chaetura_pelagica.ids"),
row.names = F,col.names = F, quote = F, sep = "\t")
This is a quantitative summary description of the resulting target files.
| Number of targets | Size of targeted genome (bp) | Number of genes | Total of scaffolds | |
|---|---|---|---|---|
| Anna | 24,044 | 3,934,867 | 2,750 | 389 |
| Swift | 14,764 | 2,279,626 | 1,469 | 372 |
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| Anna | 104 | 724 | 1,136 | 1,431 | 1,800 | 13,150 |
| Swift | 110 | 843 | 1,281 | 1,552 | 1,943 | 13,000 |
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| Anna | 1 | 90 | 125 | 163.7 | 170 | 5,246 |
| Swift | 1 | 89 | 123 | 154.4 | 166 | 5,240 |
I obtained the coverage of the targeted regions using bedtools (v. 2.22.0), module coverage. With this, and the option -hist I can report a histogram of coverage for each feature in A as well as a summary histogram for _all_ features in A. Output (tab delimited) after each feature in A (from bedtools documentation))
# bases at depthfor bamfile in $(find $HOME/anna/originals -name "*.bam"); do
tag=$(echo $(basename $bamfile) | tr "_" " " | awk '{print $1}')
nohup /home/merly/src/bedtools2/bin/bedtools coverage -hist -abam $bamfile -b Chaetura_pelagica.CDS.2.gff | gzip > $HOME/anna/bedtools/cov/${tag}.cov.gz &
done
for bamfile in $(find $HOME/swift/originals -name "*.bam"); do
tag=$(echo $(basename $bamfile) | tr "_" " " | awk '{print $1}')
nohup /home/merly/src/bedtools2/bin/bedtools coverage -hist -abam $bamfile -b Calypte_anna.gene.CDS.2750.2.gff | gzip > $HOME/swift/bedtools/cov/${tag}.cov.gz &
done
And so, I filtered the output to keep the coverage per region and the summary histogram separated.
# (i.e)
for tag in $(cat $HOME/anna/files/samples.txt ); do
nohup zcat $HOME/anna/bedtools/cov/H09.cov.gz | grep -v ^all | gzip > $HOME/anna/bedtools/nohist/H09.nohist.gz &
nohup zcat $HOME/anna/bedtools/cov/H09.cov.gz | grep ^all | gzip > $HOME/anna/bedtools/hist/H09.hist.gz &
done
for tag in $(cat $HOME/swift/files/samples.txt | tail -n+2 | head -46 ); do
nohup zcat $HOME/swift/bedtools/cov/H09.cov.gz | grep -v ^all | gzip > $HOME/swift/bedtools/nohist/H09.nohist.gz &
nohup zcat $HOME/swift/bedtools/cov/H09.cov.gz | grep ^all | gzip > $HOME/swift/bedtools/hist/H09.hist.gz &
done
With the information extracted from the bedtools coverage -hist, we can see the relation between the breadth of coverage obtained and the depth per sample.
| Map2 - target | Coverage per target overview |
|---|---|
| map2Anna | |
| map2Swift |
Zoom: 0x-100x.
| map2Anna | map2Swift |
|---|---|
From the bedtools coverage output I was able to extract the depth of coverage per target, gene and sample. I generated 2 matrices:
Each cell of the matrix correspond to the sum of the number of times each base was covered within the target or the gene. Then, coverage was calculated as the mean value of all the depth of coverage of all the samples for a specific target/gene divided by the size of the corresponding target/gene.
For the depth of coverage for the samples, I summed the coverage of all targets and divided it by the total amount of bases that were targeted.
Note: Outliers presented below were calculated using the following:
qnt <- quantile(data, probs=c(.25, .75))
H <- 1.5 * IQR(data)
y <- data
y[data < (qnt[1] - H)] <- "Outlier"
y[data > (qnt[2] + H)] <- "Outlier"
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| map2Anna | 1.977 | 45.55 | 74.34 | 97.06 | 110.5 | 296.9 |
| map2Swift | 1.715 | 38.27 | 62.29 | 83.8 | 98.07 | 265.6 |
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| map2Anna | 4.827 | 63.4 | 93.26 | 99.66 | 126.7 | 4,190 |
| map2Swift | 0.8245 | 54.24 | 81.06 | 85.55 | 111.5 | 702.6 |
# Get the outliers
qnt <- quantile(coveragePerGeneAnna, probs=c(.25, .75))
H <- 1.5 * IQR(coveragePerGeneAnna)
y <- rep("black",length(coveragePerGeneAnna))
outliers=coveragePerGeneAnna
outliers[coveragePerGeneAnna < (qnt[1] - H)] <- -1
outliers[coveragePerGeneAnna > (qnt[2] + H)] <- -1
outliers[outliers>=0]=0
outliers[outliers<0]=1
numOutliers=sum(outliers)
sizeOutlier=coveragePerGeneAnna[coveragePerGeneAnna>4000]
outlier4KIndex=which(coveragePerGeneAnna>4000)
outlier4K=unique(anna.2$V9)[outlier4KIndex]
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 222.5 | 233.7 | 251.9 | 358.9 | 280 | 4190 |
There is one gene with an extremely high coverage ( gene Parent=Aan_R000349; with coverage: 4,190.085), and these are the targets of the selected gene:
cat files/Calypte_anna.gene.CDS.2750.2.gff | grep "Parent=Aan_R000349;" | awk '{ print $0,"\t", $5-$4}'
Scaffold3026 GeneWise CDS 3 230 . + 0 Parent=Aan_R000349; 227
This might be because of the gene size, which has a size of 227 bp and only one target. In addition to that, this gene is present only in the Anna targets. I looked through the coverage of the samples, and selected the one with higher coverage, H1. Afterwards, went to the BAM file of that sample and looked this gene region (for position 3-320) with:
samtools mpileup H1_map2Anna_merge_pair.clean0.sort.bam -r Scaffold3026:3-230
And the first 10 bp are shown up here, but most of the region look similar:
Scaffold3026 3 N 1408 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
Scaffold3026 4 N 1536 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Scaffold3026 5 N 1690 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
Scaffold3026 6 N 1848 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
Scaffold3026 7 N 2110 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
Scaffold3026 8 N 2274 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
Scaffold3026 9 N 2413 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Scaffold3026 10 N 2652 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
Scaffold3026 11 N 2874 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
Scaffold3026 12 N 3118 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
qnt <- quantile(coveragePerGeneSwift, probs=c(.25, .75))
H <- 1.5 * IQR(coveragePerGeneSwift)
outliersSwiftHigh=coveragePerGeneSwift
outliersSwiftHigh[coveragePerGeneSwift > (qnt[2] + H)] <- -1
outliersSwiftHigh[outliersSwiftHigh>=0]=0
outliersSwiftHigh[outliersSwiftHigh<0]=1
numOutliersHigh=sum(outliersSwiftHigh)
sizeOutlierHigh=coveragePerGeneSwift[coveragePerGeneSwift > (qnt[2] + H)]
outlierIndex=which(coveragePerGeneSwift > (qnt[2] + H))
outliersGenesSwiftHigh=unique(swift.4$V9)[outlierIndex]
outliersSwiftLow=coveragePerGeneSwift
outliersSwiftLow[coveragePerGeneSwift < (qnt[1] - H)] <- -1
outliersSwiftLow[outliersSwiftLow>=0]=0
outliersSwiftLow[outliersSwiftLow<0]=1
numOutliersLow=sum(outliersSwiftLow)
sizeOutlieLow=coveragePerGeneSwift[coveragePerGeneSwift < (qnt[1] - H)]
outlierIndex=which(coveragePerGeneSwift < (qnt[1] - H))
outliersGenesSwiftLow=unique(swift.4$V9)[outlierIndex]
These are the outliers with high and low coverage, and their coverage distribution. All of the outliers that exist, are outliers with high coverage.
| Number of outliers | Min. | 1st. Qu. | Median | Mean | 3rd. Qu | Max. | |
|---|---|---|---|---|---|---|---|
| High Coverage | 19 | 199.1 | 207.3 | 215.7 | 246.4 | 234.3 | 702.6 |
| Low Coverage | 0 | NA | NA | NA | NaN | NA | NA |
I’m interested in those genes with coverage higher than the 3rd. quartile of the outliers with high coverage.
| Parent=Cpe_R001479; | Parent=Cpe_R001695; | Parent=Cpe_R003268; | Parent=Cpe_R011470; | Parent=Cpe_R013115; | |
|---|---|---|---|---|---|
| Coverage | 702.6021 | 265.057 | 261.7742 | 240.9443 | 250.3321 |
| Size | 794.0000 | 145.000 | 137.0000 | 349.0000 | 774.0000 |
I have these scaffolds containing these genes and the number of targets in them:
| Scaffolds | Genes | Coverage | Size | Num.Targets |
|---|---|---|---|---|
| scaffold116 | Parent=Cpe_R001479; | Parent=Cpe_R001479; | 794 | 7 |
| scaffold12 | Parent=Cpe_R001695; | Parent=Cpe_R001695; | 145 | 2 |
| scaffold158 | Parent=Cpe_R003268; | Parent=Cpe_R003268; | 137 | 1 |
| scaffold53 | Parent=Cpe_R011470; | Parent=Cpe_R011470; | 349 | 2 |
| scaffold69 | Parent=Cpe_R013115; | Parent=Cpe_R013115; | 774 | 6 |
I’m looking for a relation between gene size and depth of coverage. So I fit a linear model to the data.
##
## Call:
## lm(formula = genesAnna$Size ~ coveragePerGeneAnna)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1365.3 -705.4 -291.1 361.1 11690.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1474.6592 31.3834 46.989 <2e-16 ***
## coveragePerGeneAnna -0.4395 0.2314 -1.899 0.0577 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1116 on 2748 degrees of freedom
## Multiple R-squared: 0.001311, Adjusted R-squared: 0.0009473
## F-statistic: 3.606 on 1 and 2748 DF, p-value: 0.05766
## Parent=Aan_R000349;
## 10
##
## Call:
## lm(formula = genesAnna$Size[indices] ~ coveragePerGeneAnna[indices])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1397.1 -706.8 -292.7 352.4 11668.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1511.009 47.891 31.551 <2e-16 ***
## coveragePerGeneAnna[indices] -0.812 0.437 -1.858 0.0633 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1116 on 2747 degrees of freedom
## Multiple R-squared: 0.001255, Adjusted R-squared: 0.0008916
## F-statistic: 3.452 on 1 and 2747 DF, p-value: 0.06327
I need to understand how to interpret this
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| map2Anna | 0 | 30.75 | 79.58 | 163 | 174.3 | 39,220 |
| map2Swift | 0 | 23.13 | 64.86 | 136.7 | 145 | 56,610 |
I’m looking for a relation between target size and depth of coverage. So I fit a linear model to the data.
##
## Call:
## lm(formula = targets$Size ~ coveragePerTargetAnna)
##
## Residuals:
## Min 1Q Median 3Q Max
## -167.5 -72.7 -39.2 4.2 5077.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 168.451377 1.354405 124.37 <2e-16 ***
## coveragePerTargetAnna -0.029431 0.002159 -13.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 202.8 on 24042 degrees of freedom
## Multiple R-squared: 0.007669, Adjusted R-squared: 0.007628
## F-statistic: 185.8 on 1 and 24042 DF, p-value: < 2.2e-16
I need to understand how to interpret this
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| Anna | 0 | 30.75 | 79.58 | 163 | 174.3 | 39,220 |
| Swift | 0 | 23.13 | 64.86 | 136.7 | 145 | 56,610 |
While there is no specific color key for the coverage values, this will only serve as an overview of the coverage per target/gene per sample.
map2Anna |
map2Swift |
|---|---|
map2Anna |
map2Swift |
|---|---|
map2Anna |
map2Swift |
|---|---|
map2Anna |
map2Swift |
|---|---|
>0x.| Genes not captured |
|---|
| Parent=Cpe_R000823; |
| Parent=Cpe_R008410; |
| Parent=Cpe_R008909; |
This is a summary of the target regions that were not captured by a single base, per sample.
Checking whether the targeted regions that were not capture are the same for all the samples. The reason why the regions matched is because the target regions file was generated a posteriori from a list of exons coming from the Chicken genome. It might be possible that:
| Number of common target regions | Percentage (common target/total target regions) | |
|---|---|---|
| Within map2Anna samples | 400 | 1.663617 |
| Within map2Swift samples | 153 | 1.036304 |
| Across datasets | 0 | 0.000000 |
map2Anna target file.map2Swift dataset, due to the lower number of targets in general, even though the percentages are similar (1%)| Min. | X1st.Qu. | Median | Mean | X3rd.Qu. | Max. | |
|---|---|---|---|---|---|---|
| Anna | 1 | 95 | 130.5 | 175.9 | 178 | 2,919 |
| Swift | 1 | 93 | 124 | 135.6 | 169 | 525 |
First, is important to define what is an off-target region. For the purpose of this analysis, there are two (2) definition of an off-target region. Also, I’m assuming that the all the possible DNA captured (on/off-target) is defined by a set of scaffolds in the Calypte_anna.gene.CDS.2750.2.gff file.
There are 580 that correspond to ~ 1,089,987,593 bp. So, taking this into account:
map2Anna
Calypte_anna.gene.CDS.2750.2.gff targets.mapt2Swift
Chaetura_pelagica.CDS.2.gff targets.Calypte_anna.gene.CDS.2750.2.gff and the regions in Chaetura_pelagica.CDS.2.gff.
Calypte_anna.gene.CDS.2750.2.gff targets \(\cup\) Chaetura_pelagica.CDS.2.gff targets).Generation of the off-target datasets starts with the identification of the off-target regions. First, obtaining the size of the scaffolds. This has been made by identifying the highest position per scaffold in Calypte_anna.gene.CDS.2750.2.gff and in Chaetura_pelagica.CDS.2.gff, getting with this the sizes of the scaffolds. Afterwards, generating a BED file with the regions (scaffold, startPosition, endPosition) and from which I subtracted both GFF target files. In this way I have the regions that were not targeted, and that are within the range of the data.
The idea is to identify the off-target regions, as well as the size that covers and how is the coverage distribution within this regions compared to the on-target coverage.
To obtain coverage information from the off-target regions:
for bamfile in $(find $originalsAnna -name "*.bam" ); do
tag=$(echo $(basename $bamfile) | tr "_" " " | awk '{print $1}')
nohup /home/merly/src/bedtools2/bin/bedtools coverage -hist -abam $bamfile -b "$HOME/files/offtarget.anna.bed" | gzip > $HOME/anna/bedtools/cov2/${tag}.cov.gz &
nohup /home/merly/src/bedtools2/bin/bedtools coverage -hist -abam $bamfile -b "$HOME/files/offtarget.bed" | gzip > $HOME/anna/bedtools/cov3/${tag}.cov.gz &
done
for bamfile in $(find $originalsSwift -name "*.bam" ); do
tag=$(echo $(basename $bamfile) | tr "_" " " | awk '{print $1}')
nohup /home/merly/src/bedtools2/bin/bedtools coverage -hist -abam $bamfile -b "$HOME/files/offtarget.swift.bed" | gzip > $HOME/swift/bedtools/cov2/${tag}.cov.gz &
nohup /home/merly/src/bedtools2/bin/bedtools coverage -hist -abam $bamfile -b "$HOME/files/offtarget.bed" | gzip > $HOME/swift/bedtools/cov3/${tag}.cov.gz &
done
# hist split
for tag in $(cat $HOME/anna/files/samples.txt ); do
zcat $HOME/anna/bedtools/cov2/$tag.cov.gz | grep -v ^all | gzip > $HOME/anna/bedtools/nohist2/$tag.nohist.gz
zcat $HOME/anna/bedtools/cov2/$tag.cov.gz | grep ^all | gzip > $HOME/anna/bedtools/hist2/$tag.hist.gz
zcat $HOME/anna/bedtools/cov3/$tag.cov.gz | grep -v ^all | gzip > $HOME/anna/bedtools/nohist3/$tag.nohist.gz
zcat $HOME/anna/bedtools/cov3/$tag.cov.gz | grep ^all | gzip > $HOME/anna/bedtools/hist3/$tag.hist.gz
done
for tag in $(cat $HOME/swift/files/samples.txt ); do
zcat $HOME/swift/bedtools/cov2/$tag.cov.gz | grep -v ^all | gzip > $HOME/swift/bedtools/nohist2/$tag.nohist.gz
zcat $HOME/swift/bedtools/cov2/$tag.cov.gz | grep ^all | gzip > $HOME/swift/bedtools/hist2/$tag.hist.gz
zcat $HOME/swift/bedtools/cov3/$tag.cov.gz | grep ^all | gzip > $HOME/swift/bedtools/hist3/$tag.hist.gz
zcat $HOME/swift/bedtools/cov3/$tag.cov.gz | grep -v ^all | gzip > $HOME/swift/bedtools/nohist3/$tag.nohist.gz
done
I followed what has been done for the on-target regions to get this information. Getting how much of the off-target regions was captured, and at which depth.
map2Anna |
map2Swift |
|---|---|
| X-axis: 0-20 | X-axis: 0-100 |
map2Anna dataset, for both off-target sets breadth vs. depth looks the same. Same situation with the mat2Swift dataset.map2Anna, we see that the general depth is low, 25% of the off-target regions are covered at 1x for most of the samples.map2Swift, we see that there are 3 situations:
AnnaBGI and SwiftBGI in the map2Anna datasets, is because these samples were not in that mapping.| Min. | 1st. Qu. | Median | Mean | 3rd. Qu | Max. | Var. | Std. Dev | |
|---|---|---|---|---|---|---|---|---|
| map2Anna | Off-target 1 | 0 | 1.518 | 7.4 | 48.29 | 21.89 | 34,600 | 243,841.7 | 493.8033 |
| map2Anna | Off-target 2 | 0 | 0 | 0.4007 | 28.4 | 10.18 | 27,260 | 153,951.500672313 | 392.366538675653 |
| map2Swift | Off-target 1 | 0 | 0.6185 | 2.794 | 34.51 | 9.069 | 37,860 | 337,580.14666756 | 581.016477105048 |
| map2Swift | Off-target 2 | 0 | 0 | 0.1844 | 17.69 | 2.691 | 66,150 | 254,179.229772137 | 504.16190829151 |
This computes depth distribution for every sample and for all samples jointly. Computed wiht ANGSD [angsd version: 0.916 (htslib: 1.3.2) build(May 2 2017 16:21:49)].
angsd -bam bam.filelist -doDepth 1 -out all -doCounts 1 -maxDepth 1000
Output of ANGSD are 2 files:
.depthSample: This file contains nInd number of lines. Column1 is the number sites that has sequencing depth=0,Column2 is the number of sites that has sequencing depth=1 etc
.depthGlobal: The sequencing depth for all samples.
I am trying to analyze if there is any correlation between the depth of coverage and the phylogenetic distance from the reference species, used for the probe generation and mapping.
Information of the phylogenetic reconstruction I got from the Hummingbirds Paper, this says that it was made with mitochondrial genome and nuclear gene trees using either a concatenation approach with RAxML [3] or the multi-species coalescent approach implemented in ASTRAL [4] and ASTRAL-II [5].
The phylogenies were built using subsets of the nuclear genes present the same topology between the main groups of hummingbirds with high confidence. The three subsets correspond to:
This is the Supplementary Table S5 (from the Hummingbirds Paper), describing the subsets selected:
| number of genes | minimum number of sites per species (concatenated) | maximum number of sites per species (concatenated) | includes outgroup | ASTRAL score |
|---|---|---|---|---|
| 741 | 949,470 | 1,542,750 | yes | 87% |
| 1987 | 1,667,071 | 2,783,832 | yes | 78% |
| 2949 | 2,473,664 | 3,792,500 |
Is this really this dataset? weren’t we working with originally 2750 genes?
I’ll be using only the subset for the 2949 genes.
Cpe, outgroup reference.Aan, ingroup reference.Inferred phylogeny
To obtain the distance matrix I used the tree obtained with RAxML, from the concatenation of 2949 genes. The tree gives me the branch lengths in number of substitutions per site. So, the distance calculated here, is the pairwise distance of the tips. This was done with the R package phangorn [7]
sampleColors = colorRampPalette(brewer.pal(9, "Set1"))(48)
treefile="/home/merly/git/cph-visit/coverage-analysis/capture-phylotenetic-decay/trees/RAxML_bipartitions.2011.concat"
tree=read.nexus(treefile)
distMatrix=cophenetic(tree)
dCpe=distMatrix["Cpe",]
dAan=distMatrix["Aan",]
distMatrix[upper.tri(distMatrix)]=NA
Distance matrix
I have done this following the figure F3A in Braggs’s paper [8], where it shows sequencing coverage as a function of genetic divergence. I have calculated the depth of coverage for both datasets and phylogenetic distance, now I have the relationship plot:
##
## Call:
## lm(formula = disCoverageAnna$distance ~ disCoverageAnna$median)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.022120 -0.008747 -0.003240 0.009100 0.038334
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.319e-02 2.885e-03 11.504 7.47e-15 ***
## disCoverageAnna$median -1.549e-05 2.942e-05 -0.526 0.601
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01253 on 44 degrees of freedom
## Multiple R-squared: 0.006258, Adjusted R-squared: -0.01633
## F-statistic: 0.2771 on 1 and 44 DF, p-value: 0.6013
interpret this
##
## Call:
## lm(formula = disCoverageSwift$distance ~ disCoverageSwift$median)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0072130 -0.0028632 -0.0006428 0.0020115 0.0198364
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.262e-01 1.054e-03 119.698 <2e-16 ***
## disCoverageSwift$median -3.490e-06 1.273e-05 -0.274 0.785
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.004584 on 44 degrees of freedom
## Multiple R-squared: 0.001705, Adjusted R-squared: -0.02098
## F-statistic: 0.07515 on 1 and 44 DF, p-value: 0.7853
interpret this
References